Clustering based on Dissimilarity First Derivatives
نویسنده
چکیده
A hierarchical agglomerative clustering algorithm based on the analysis of dissimilarity increments between neighboring patterns is presented. The first derivative of dissimilarity between neighboring patterns inside a natural cluster is modelled by an exponential distribution, this statistic characterizing the cluster. A cluster isolation criterion is defined based on estimates of each cluster dissimilarity increments mean value, continuously updated along the clusters formation process, under a hierarchical agglomerative framework. Unreliable estimates, mainly occurring when cluster cardinality is low, can lead to over-fragmentation of the data into spurious, small sized clusters. In order to prevent this situation, a regularizing function is proposed to widen the estimates of the exponential distribution mean, when the number of samples is small. Analysis of the method is performed in a comparative study with the well known single-link and k-means algorithms. Application examples using both syntectic and real data show the ability of the method to identify arbitrary shaped clusters.
منابع مشابه
خوشهبندی دادههای بیانژنی توسط عدم تشابه جنگل تصادفی
Background: The clustering of gene expression data plays an important role in the diagnosis and treatment of cancer. These kinds of data are typically involve in a large number of variables (genes), in comparison with number of samples (patients). Many clustering methods have been built based on the dissimilarity among observations that are calculated by a distance function. As increa...
متن کاملComposite Kernel Optimization in Semi-Supervised Metric
Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the to...
متن کاملارتقای کیفیت دستهبندی متون با استفاده از کمیته دستهبند دو سطحی
Nowadays, the automated text classification has witnessed special importance due to the increasing availability of documents in digital form and ensuing need to organize them. Although this problem is in the Information Retrieval (IR) field, the dominant approach is based on machine learning techniques. Approaches based on classifier committees have shown a better performance than the others. I...
متن کاملOn Data-Independent Properties for Density-Based Dissimilarity Measures in Hybrid Clustering
Hybrid clustering combines partitional and hierarchical clustering for computational effectiveness and versatility in cluster shape. In such clustering, a dissimilarity measure plays a crucial role in the hierarchical merging. The dissimilarity measure has great impact on the final clustering, and data-independent properties are needed to choose the right dissimilarity measure for the problem a...
متن کاملClustering with Intelligent Linexk-Means
The intelligent LINEX k-means clustering is a generalization of the k-means clustering so that the number of clusters and their related centroid can be determined while the LINEX loss function is considered as the dissimilarity measure. Therefore, the selection of the centers in each cluster is not randomly. Choosing the LINEX dissimilarity measure helps the researcher to overestimate or undere...
متن کامل